-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preliminary api client and parsl apps to export data and start training task #25
base: develop
Are you sure you want to change the base?
Conversation
creation_time: datetime.datetime, | ||
) -> str: | ||
"""Generate URL for an export file generated by :meth:`document_export`""" | ||
# is there any way get current user id from api based purely on token? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# is there any way get current user id from api based purely on token?
Yes, the following API URL returns the current user only: https://test-htr.lib.princeton.edu/api/users/current/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thank you! How did you know that? Is there other documentation / code I should be looking at to understand the API better?
(I see the method in the escriptorium view code now. I guess I'll have to remember to keep referencing the api views.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was viewing https://test-htr.lib.princeton.edu/api/users/ in the browser (mostly to check whether I was right to think that regular users would only ever see themselves, but admins would see everyone) when I noticed Current
under the Extra Actions. So unfortunately no extra documentation I can point you to, but there are some helpful methods we can keep an eye out for!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh! I hadn't noticed that extra actions yet! That is good to know about.
I see now that there's a documents tasks api endpoint, maybe it would be better to use that one for getting the export status...
# WORKFLOW_STATE_ERROR = 2 | ||
# WORKFLOW_STATE_DONE = 3 | ||
# WORKFLOW_STATE_CANCELED = 4 | ||
# add a property? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially. Something like WORKFLOW_STATE_CLUSTER to indicate that a task has been sent to the cluster? I'm assuming we will not be monitoring and passing data back and forth to such an extent as to track when a slurm task is still queued vs when it is running.
src/htr2hpc/api_client.py
Outdated
resp = self._make_request(api_url) | ||
return ResultsList(result_type="model", **resp.json()) | ||
|
||
def update_model(self, model_id: int, model_file: pathlib.Path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is overwriting an existing model file in eScriptorium with the newly trained model. I think that should work for when the user clicks the "Overwrite existing model" checkbox. But otherwise, we will want to use a POST method to get a new and separate model into the user's model list.
When POSTing a new model, the required data for it to upload is the model file, its name
, and its job
(Segment or Recognize). To display properly in eScriptorium we also want its accuracy_percent
.
The later is in the mlmodel file, under kraken_meta
and metrics
. It's going to be the last value in the accuracy list (since the mlmodel file tracks all accuracies of prior epochs as well). I know you can access it via Python using something like the following:
from kraken.lib import vgsl
m = vgsl.TorchVGSLModel.load_model(PATH_TO_MODEL)
m.user_metadata['accuracy'][-1][-1]*100
or:
import coremltools
import json
m = coremltools.models.MLModel(PATH_TO_MODEL)
meta = json.loads(m.get_spec().description.metadata.userDefined['kraken_meta'])
meta['accuracy'][-1][-1]*100
Unfortunately, I was testing api model imports and trying to make them display the accuracy but this was not working, and I think it is because the "accuracy_percent" field is set to read_only in the eScriptorium API...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for POSTing a new model we'd want something like this, they're easy enough changes to make. The problem just remains that I can't get accuracy_percent to update. (The upload goes through, it just means there's no recorded accuracy in the model list which is... not ideal.)
def upload_model(self, model_file: pathlib.Path, model_name: str, model_job: str):
"""Upload the new model file to a user's list of models."""
api_url = f"models/"
with open(model_file, "rb") as mfile:
files = {"file": mfile}
data = {
"name": model_name, # required, get from user
"job": model_job, # required, either "Segment" or "Recognize", get from user
"accuracy_percent": 0.0, # 0.0 by default; can be found in the mlmodel file
# file_size (int); not required when POSTing with API, will be automatically filled in
# versions ? does not seem to be used
}
resp = self._make_request(api_url, method="POST", files=files, data=data)
# on successful update, returns the model object
return to_namedtuple("model", resp.json())
"file_format": "alto", | ||
"include_images": True, | ||
"region_types": block_types, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easy enough to add
"parts": parts_list,
And add parts_list
as input to the function -- if we are training on (and therefore exporting) the full doc we don't need it, but if we are only doing a selection of images then this will let us handle that.
"""details for a single task""" | ||
api_url = f"tasks/{task_id}/" | ||
resp = self._make_request(api_url) | ||
return to_namedtuple("task", resp.json()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will also want functions to locate and download model files, if a user desires to train on top of a given model. For transcription, a user can either train on top of a given transcription model or from scratch. For segmentation, a user can either train on top of a given segmentation model or "from scratch", EXCEPT here "from scratch" actually means "on top of the base blla.mlmodel that comes as the default segmenter in kraken". So we need both a way of downloading a selected model and a way of passing the blla.mlmodel along with our data.
Fortunately, given a MODEL_PK, we can simply use BASE_URL/api/models/MODEL_PK/ and then get the URL to download the model in the "file" field. requests.get()
should then work just fine.
For blla.mlmodel, we can just grab it from the kraken install, or store a copy somewhere.
src/htr2hpc/train.py
Outdated
args.document_id, | ||
args.transcription, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably add logic like: if args.model exists, then run a prep_model() function taking that MODEL_PK as input and saving the model as model_data
or something. That would handle both cases of segmentation and transcription training where the user chooses to finetune on an existing model.
To handle segmentation training "from scratch" (which is actually on top of the default blla.mlmodel), we would want logic like: if args.model does not exist and args.job == 'segmentation', then load in blla.mlmodel instead.
src/htr2hpc/train_apps.py
Outdated
|
||
# return training directory as a parsl file | ||
return File(data_dir) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably also have a prep_model()
function, taking as input es_base_url
, es_api_token
, model_id
.
src/htr2hpc/train_apps.py
Outdated
|
||
|
||
@bash_app | ||
def segtrain(inputs=[], outputs=[]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could put both transcription and segmentation training in one function, but we probably want a separate train
function for transcription training. I'm laying out the usual eScriptorium train/segtrain commands as they would be formatted in command-line kraken below.
-
A command for transcription training from scratch:
ketos train --epochs 50 -o model --workers HOW_MANY_CPUS -d cuda:0 -f alto {input_data_dir}/*.xml
- (For workers, use the number of CPUs requested in the slurm job.)
-
A command for transcription refining on another model:
ketos train --epochs 50 --resize union -i {path_to_starting_model} -o model --workers HOW_MANY_CPUS -d cuda:0 -f alto {input_data_dir}/*.xml
-
A command for segmentation training "from scratch":
ketos segtrain --epochs 50 --resize both -i {path_to_blla_mlmodel} -o model --workers HOW_MANY_CPUS -d cuda:0 -f xml {input_data_dir}/*.xml
-
A command for segmentation refining on another model:
ketos segtrain --epochs 50 --resize both -i {path_to_starting_model} -o model --workers HOW_MANY_CPUS -d cuda:0 -f xml {input_data_dir}/*.xml
src/htr2hpc/api_client.py
Outdated
api_url = "users/current/" | ||
return to_namedtuple("user", self._make_request(api_url).json()) | ||
|
||
def models(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@laurejt I've been thinking it would be better to differentiate related methods more clearly, e.g. list_models
, get_model
, etc. Would appreciate your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that sounds like a good idea. Giving methods names that are more descriptive of their function should hopefully make it easier to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe get_model_list
?
src/htr2hpc/train_apps.py
Outdated
data_dir = pathlib.Path("training_data") | ||
data_dir.mkdir(exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable enough, but just wanted to note that if there's any chance of a training_data
file to exist, this will raise a FileExistsError error.
src/htr2hpc/api_client.py
Outdated
} | ||
|
||
|
||
def to_namedtuple(name: str, data: any): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps moved to the top of file since ResultsList
also depends on to_namedtuple
.
src/htr2hpc/api_client.py
Outdated
api_url = "users/current/" | ||
return to_namedtuple("user", self._make_request(api_url).json()) | ||
|
||
def models(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that sounds like a good idea. Giving methods names that are more descriptive of their function should hopefully make it easier to use.
from collections import namedtuple | ||
from dataclasses import dataclass | ||
from time import sleep | ||
import pathlib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps only import Path
import pathlib | |
from pathlib import Path |
# this will match the completion time of the task... | ||
base_filename = "export_doc%d_%s_%s_%s" % ( | ||
document_id, | ||
slugify(document_name).replace("-", "_")[:32], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this in more detail? Using slugify
seems like overkill, or at the very least confirming which characters are being removed and/or replaced.
# import.export.BaseExporter logic | ||
# NOTE2: escriptorium code uses datetime.now() so there's no guarantee | ||
# this will match the completion time of the task... | ||
base_filename = "export_doc%d_%s_%s_%s" % ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider replacing with an f-string
# TODO: confirm correct task id based on method (export) and document name | ||
export_task = task_list.results[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you only need the first item from the results list?
pk: int | ||
document: int | ||
document_part: int | ||
workflow_state: int | ||
label: str | ||
messages: str | ||
queued_at: datetime.datetime | ||
started_at: datetime.datetime | ||
done_at: datetime.datetime | ||
method: str | ||
user: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps worth some comments, especially for the cases where these are identifiers rather than contents or filenames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it might be worth renaming document
to document_id
src/htr2hpc/train.py
Outdated
parser.add_argument( | ||
"-p", "--parts", help="Optional list of parts to export (if not entire document)" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this component has been worked out yet, but I'm guessing this may use nargs=+
Co-authored-by: Laure Thompson <[email protected]>
Co-authored-by: Laure Thompson <[email protected]>
Co-authored-by: Laure Thompson <[email protected]>
in relation to #24